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Abstract: Wc analyse the properties of the Principal Fitted Components 
qq i (PFC) algorithm proposed by Cook. We derive theoretical properties of the 

resulting estimators, including sufficient conditions under which they are 
Y^-consistent, and explain some of the simulation results given in Cook's 
paper. We use techniques from random matrix theory and perturbation 
' theory. We argue that, under Cook's model at least, the PFC algorithm 

' should outperform the Principal Components algorithm. 
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d ' 1. Introduction and notation 



Cook [1] gives an extensive account of the history and controversy, dating back 
at least to the work of Fisher [4] , surrounding the question of whether the selec- 
tion of regression variables should be based only on values of the predictor, or 
whether the response should also be taken into account. For example, reduction 
by Principal Components (PC), introduced by Hotelling [7] and reviewed by for 
example Seber [12], only uses the sample covariance matrix of the predictors to 
select the variables, ignoring the values of the response. 

Cook [1] argues that, while PC can play a useful role in regression problems, 
the analysis can also take account of the random response Y, and introduces 
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the Principal Fitted Components (PFC) algorithm to do this. This algorithm 
performs a principal components analysis on the fitted sample covariance matrix 
obtained by projecting the predictor onto f y , a function of the response. Efficient 
selection of variables can allow regression algorithms to operate in a space of 
reduced dimension, making the resulting estimates potentially more accurate. 

In this paper, we analyse the properties of the PFC algorithm, in the context 
of a model given by Equation (5) of Cook [I], following the notation of [1] 
throughout. Cook converts the more familiar forward linear model Y = X/3 + e 
into an equivalent inverse regression form: 

Example 1.1. Consider a sample of n independent observations ~X. y in W 
indexed by the responses y and generated as 

Xy = H + TfBty + tie. (1) 

Here, fi and each X y are p x 1 matrices (column vectors), T is a full rank 
p x d matrix (with d < p), (3 is a d x r matrix (with d < r) and f y is an r x 1 
matrix. The parameter a gives the scale factor of e, apxl matrix of errors. The 
entries of e will often (but not always ) be assumed to be independent standard 
normals. We will estimate the span of the columns of the rank d matrix T , 
where by assumption, T T T = Id (we can reparameterize the model to achieve 
this). Having estimated T, we can work in a space of smaller dimension, using 
whatever techniques are appropriate. 

The fj, is a vector-valued function of the random response Y, and for sim- 
plicity we assume throughout that J2 y = 0- Further we assume that e is 
independent of Y, and hence of i y . (Note that we use different symbols, e and 
e, for the error terms in Equation (1) and the forward regression Y = X/3+e. Wc 
do not claim that e is independent of Y) . The f y can be constructed in a variety 
of ways, depending on the exact form of the data. For example, Cook mentions 
that if the conditional mean E(X|y = y) can be modelled by a polynomial of 
degree r, then it is appropriate to take i y = (y — mi, y 2 — m 2 , . . . , y r — m r ) T , 
where m u is the sample mean of the uth power of y. Alternatively, in the spirit 
of the Sliced Inverse Regression algorithm of Li [10], f y can be constructed by 
slicing the range of Y into (r + 1) disjoint bins. 

For ease of calculation we convert the model from Example 1.1 into matrix 
form. We write X for the nxp centred matrix of predictors, with rows (Xj, — X) T 
(where X is the sample mean of X), write F for the n x r matrix with rows i y , 
and E for the nxp error matrix with rows ae T . Given a full rank matrix G 
we write Pg = G(G T G) _1 G T for the matrix which projects orthogonally onto 
the span of the columns of G. 

Definition 1.2. Cook [1] defines the fitted matrix of predictors X = PpX and 
proposes the PFC algorithm; an estimate Tppc of T is given by the set of d 
eigenvectors of the fitted sample covariance matrix X T X which correspond to 
the largest eigenvalues. 

Cook contrasts this with the PC algorithm of Hotelling [7] , which performs 
the corresponding calculation for the sample covariance matrix X T X. That is, 
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an estimate Tpc of T is given by the set of d eigenvectors of X T X corresponding 
to the largest eigenvalues. Note that we refer to Tppc and Tpc as estimates of 
r for the sake of brevity; in fact, the span of the columns of Tppc and Tpc 
form estimates of the span of the columns of T. 

Theorem 2.4 of this paper gives a distributional result for the accuracy of 
PFC estimation. This allows us to explain the simulations relating to PFC 
estimation in Section 5 of Cook's paper [1], and to bound confidence intervals 
for the accuracy of the estimation, sec Theorem 3.3 below. In Section 3 we 
give sufficient conditions for Tppc to be a y^-consistent estimator of T, in the 
case d = r. In Section 4, we use results from perturbation and random matrix 
theory to consider the more general case d < r and consider the order of the 
errors that arise. In Theorem 4.3, we give sufficient conditions for Tpfc to be a 
Y^-consistent estimator of T in this more general case. 

Model (1) represents a problem in dimension reduction. The PC model is 
given as Equation (2) of [1] as 

Xj, = [i + Vu y + ere. (2) 

where rfx 1 vector u y satisfies u y = 0- Hence the model (1) is a special case 
of (2), where the (3f y replaces v y . As Cook remarks, in the PFC model (1) we 
aim to estimate /3, which contains dr parameters, whereas the PC model (2) 
contains the (n— l)d parameters of u y . Hence, the PFC model has the attractive 
feature that the number of parameters to be estimated does not grow with n. 

In Section 5 we argue that the PFC algorithm should perform strictly better 
than the PC algorithm, if the errors e are normally distributed. Specifically, 
Lemma 5.1 shows that in this case the sample covariance matrix X T X is the 
fitted covariance matrix X T X perturbed by random noise independent of X 
and r. Hence inference about T using X (PC algorithm) will be necessarily 
less accurate than inference using X (PFC algorithm). Proposition 5.4 gives 
bounds in certain parameter regimes that imply that the PC estimate Tpc is 
also Op(l/ y/n), approximately explaining simulations relating to PC estimation 
in Section 5 of [1]. All the proofs of this paper are presented in Appendix A. 

It is of course important to consider the validity of the model given by Equa- 
tion (1), in order to understand the significance of these results. The key is the 
reversal of the conditional distribution for Y|X implied by the linear model 
Y = X/3 + e to give a conditional distribution for X|(Y = y), as in Equa- 
tion (2). Such a reversal is not new, arising for example in the work of Oman 
[11]. If the X and e are each multivariate normal, then (X, Y) will be jointly 
multivariate normal, and we can move between Y = X/3 + e and Equation (2) 
by parameterizing appropriately. The question of the validity of the full PFC 
model Equation (1) amounts to the issue of how accurately the /3f y models v y . 
Presumably for real data there will be a trade-off between improved accuracy 
arising from PFC estimation and errors introduced by f3t y not adequately ex- 
plaining u y . Implicitly, model (2) requires that the X should be random and 
take continuous values. Hence X should come from measurements rather than 
designed experiments, and the case of factor variables is excluded. 
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In future work, we hope to explain the performance (in the sense of Mean 
Squared Error) of prediction errors shown in Figure 1(d) of Cook's paper [1], 
and to investigate the theoretical properties of the PFCpc algorithm described 
in Section 6 of [1]. 



2. Theoretical PFC performance when d — r 

In this section, we analyse the performance of the PFC algorithm in the case 
d = r. In Lemma 2.2 we give explicit expressions for the matrices involved, 
and define a matrix V which gives the span of the PFC directions. In Lemma 
2.3, we deduce distributional results for V, which we use to prove Theorem 2.4. 
Finally, we show how Theorem 2.4 explains some of the simulation results given 
in Section 5 of [1]. 

To measure the accuracy of our estimates, we consider the distribution of a 
normalised version of the quantity m(T,T) = ||(I P — Pr)r|| used, for example, 
by Xia, Tong, Li and Zhu [15]. We have a choice of which matrix norm || ■ || to 
use; we shall use the Frobcnius norm || ■ \\p defined by 

||A||| = tr (AA T ) = tr (A T A) = £ , 

k 

where is the kth column of A and | • | represents the vector norm. This choice 
of matrix norm has the attractive feature that ||A|| F = ||AP|| F for orthogonal 
P, so that we can make orthogonal changes of basis without affecting m(T, T). 



Definition 2.1. For true T and estimated value T, we define 

l|P r r||| 



c(T,r) 



|(i p -p r )r llF 



In the case d > 1, the quantity C(r,r) will change on rescaling columns of 
the matrix T by different amounts (that is, on multiplying T by diagonal D not 
proportional to I) . We do not have a priori estimates for the true scalings of the 
eigenvectors, so use the scalings that arise from orthogonal transformations of 
the columns of a matrix V arising from matrix factorization (see Lemma 2.2). 
Lemma 2.3 shows that this choice has the attractive feature that the entries of 
V have equal variance. There may exist better scalings in the sense of reduced 
average angle, however this choice already has good properties, as the theorems 
of this paper show. 

Lemma 2.2. Under model (1) from Example 1.1, the span of the r largest 
eigenvectors of X T X is identical to the span of columns of the p x r matrix V 
defined by 

V = T/3(F T F) 1/2 +E T F(F T F)~ 1/2 . (3) 

In particular, in the case where d = r these columns define the PFC space. Note 
that V can be found from known information and without calculating eigenval- 
ues, by V T = (F T F)- 1 /2 F Tx = (F T F)- 1 /2F T X. 
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The proofs of all the results of this paper are given in Appendix A. Next we 
give distributional results for V, by conditioning on the values of F. 

Lemma 2.3. Assume that the errors e have mean zero and are uncorrelated 
with variance 1. Conditioned on the values ofF, the entries of "V have mean 
T/3(F T F) 1 / 2 and variance a 2 , and are uncorrelated. 

Lemma 2.3 implies that if the errors e arc independent standard normals, 
then the columns of V arc independent multivariate normals, with means given 
by the columns of r/3(F T F) 1 / 2 and with covariance matrix a 2 I p . Hence the pxp 
matrix X T X = VV T has a non-central Wishart distribution with r degrees of 
freedom, scale parameter o~ 2 I p and non-centrality parameter Tf3(F T F)f3 T T T . 
This allows us to deduce a distributional result for C. 

Theorem 2.4. Under the model given by Equation (1), assume that the errors 
e are independent standard normals, and consider the PFC estimator Tppc- In 
the case d — r, conditional on (F T F) 7 the term 

(p-d)C(fp FC ,r) _ (p-d)||P r fpFc||| m 



d||(I p -Pr)r PFC 



2 



where -F r d,r(p-d)(A) denotes a non-central F distribution with (rd,r(p — d)) de- 
grees of freedom and non-centrality parameter given by the scaled trace A 
tr (f3(F T F)(3 T )/a 2 . 

The quantity C(T,T) measures the proportion of the magnitude of the es- 
timate r which lies in the span of the columns of T, and hence measures how 
good an estimate of the span of T is provided by T. In the case r = d = 1, 
this is compatible with Cook's plots of the angle 0(T,r) between true T and 
estimated T, in the sense that for any T and T the C(T, T) = cot 2 9(r, T). 

Section 6 of Li, Zha and Chiaromonte [9] introduces a different measure of 
similarity of subspaces as ||Pr — Pp||, where |j • || represents the operator norm. 

In the case d = r = 1, this again is compatible with the angle 8(T,r), in 
the sense that in this case ||Pr — Pp|| = | sin6(r,r)|. The use of the operator 
norm means that the measure of [9] represents 'worst case' performance, whereas 
using the Frobcnius norm gives 'average' performance. Our techniques do not 
at present give distributional results for ||Pr — Pf||, however, Frobcnius norms 
are typically easier to calculate than operator norms. 

Theorem 2.4 shows that the distribution of C(rpFc,r) does not depend on 
the value of F itself, helping to explain Cook's remark [1, P.10] that "the value 
of principal component estimators does not rest solely with the presence of 
collinearity" . Further, using Theorem 2.4, we can better understand the simu- 
lation graphs given in Section 5 of Cook [1]. 

Example 2.5. Figure 1 of [1] considers the model given by Equation (1) in 
the case where p = 10, d = r = \, (3 = \. Further, the errors e are nor- 
mally distributed and Y is normal with variance a\, so that F T F = ^2i =1 {yi — 
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y) 2 ~ "yXn-i- We consider the angle 0(rppc,r), where cot 2 0(rppc, T) = 

c(f PFC ,r). 

In Figure 1 we simulate directly from the distribution given by Theorem 2.4 
and vary parameters n, a and ay . Based on a sample of size 50, 000 for each set 
of parameter values, we plot the sample mean as o, along with upper and lower 
5% sample quantiles (plotted as A and +). The sample means shown here in 
Figure l(a)-l(c) fit closely with those in Figure l(a)-l(c) of [1]. 

3. y^-consistency of PFC estimates 

The quantiles plotted in Figure 1 give some idea of the spread of likely val- 
ues of the angle 0(TpFc,r). Theorem 2.4 shows that the squared cotangent 
C(rpFc,r) is a scalar multiple of a non-central F distribution with random 
non-centrality parameter, a complicated hierarchical form of mixture distribu- 
tion, meaning that it is not trivial to give confidence intervals for the angle 
9(rpFc,r) in closed form. 

In the case d = r we give probabilistic bounds in Theorem 3.3 demonstrating 
that 0(rpFC,r) decays like l/\/n (so Tppc is a -y/n-consistcnt estimator of 
r), for a more general class of error models than simply assuming normality. 
For simplicity of exposition, we restrict to the case where the distributions of 
e are symmetric, though this is not necessary for -y/n-consistency to hold - 
see Appendix A for details. First we give two technical results, Lemmas 3.1 and 
3.2, concerning the entries of the matrix V introduced in Lemma 2.2 and matrix 
/3(F T F)/3 T respectively. 

Lemma 3.1. If the errors e are independent and symmetric with variance 1 
and finite J^th moment 1TI4 then writing N = | (I — Pr)V|| 2 ? we know that 

EN = tr (f3(F T F)f3 T )+rda 2 and Var (N) < dT+Aahv (/3(F T F)/3 T ), (4) 

where T = o~ 4 r(m,4 — 1) does not depend on n. Similarly, writing D = ||PrV|||. 
we know that under the same conditions 

ED = r(p - d)a 2 and Var (D) < (p - d)T. (5) 

In the case where the errors e are standard normal, Lemma 3.1 simplifies, 
since the C^fc become independent and identically distributed normals. Thus 
D/a 2 is central % 2 with r(p—d) degrees of freedom. Similarly N/a 2 is non-central 
X 2 with rd degrees of freedom and non-centrality parameter tr (/3(F T F)/3 T )/er 2 , 
and T = 2rer 4 , with equality holding in the bounds on Var (TV) and Var (D) in 
Equations (4) and (5). 

We will write Ai(X) > A2(X) > ... > A P (X) for the ordered sequence of 
eigenvalues of a real symmetric p x p matrix X. 
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1 2 3 4 5 



(c) Anglo e(f PFC ,r) vs <t 

FlG 1. Simulation of angle 0(rppc,r) between Tppc an d T, in setting of Example 2.5. We 
plot the sample mean as o, and the upper and lower 5% sample quantiles as A and +■ (a) 
Angle vs n, with cry = a = 1; (b) Angle vs ay , with n = 40, a = 1; (c) Angle vs a, with 
n = 40, cry = 1. 
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Lemma 3.2. There exists a sequence (<f>i : 1 < i < d) such that given e > and 
S > 0, there exists n* = n*(S,e) such that the eigenvalues satisfy 

P U>i - 5 < Mff(jj_jtyj ) <<f, i + s for alii = l,...,d, and n > n*"j > 1- e. 

Theorem 3.3. In the case where d = r and errors e are independent and sym- 
metric with variance 1 and finite 4th moment, then we can construct confidence 
intervals such that 

p(e(f PFC ,r) > 9* (a)) < a and P (e(f PF c,r) < 91(a)) < a, 

where for any fixed a, the Q±(a) = 0(1/ \/n). 

Using Theorem 3.3, we can consider the case analysed in Example 2.5 and 
Figure 1, where /3 = 1, r = d = 1, p = 10 and the errors e and Y are normal. 
In this case, (F T F) ~ o-yXn-ii so we can take Ki — Oy ~ e an d ^2 = 0y + £• 
Equations (26) and (27) below show that confidence intervals for the angle 
9(rppc, r) decay asymptotically as c/^Jn, cjay or co respectively, as the other 
terms arc kept constant, as Figure 1 may suggest. 



4. Random matrices and perturbation 



In this section, we analyse the general case d < r of the model given by Equation 
(1) under the assumption that the errors e are Gaussian, using a perturbation 
argument. We first review some results from random matrix theory, which we 
use to deduce the order of 9(rpFc,r) in Theorem 4.3. 

Proposition 4.1. Write Xj,, for a u x v matrix with entries that are inde- 
pendent standard Gaussians, and consider the largest eigenvalue of the Wishart 
matrix X^Xj^. 

1. For a sequence of matrices X„ „, in the regime where v — ► oo and u v /v — > 
ft 

-Ai(X Mtil ,X^ ) — ► (1 + \fff) 2 almost surely. (6) 

v ' 

2. In the same regime 

Ai(X u „,„X^.J - (Vv - 1 + V^) 2 v „ 
►-fi, (I) 

where a uv = (\/v — 1 + \/u)(l/ \Jv — 1 + l/^/u) 1 ^ 3 , and where F\ is the 
Tracy-Widom law of order 1. 

3. For any u and v and for any t > 0, 



P(A 1 (X„,„X^J > (V^+V^ + t) 2 ) <cxp(-£ 2 /2). 



(8) 
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Part 1 of Proposition 4.1 is due to Geman [6], who does not require that 
the entries of the matrix X U I) be Gaussian (finiteness of moments of all orders 
will suffice). However, Equation (8) does require Gaussian entries, which is the 
reason for the assumption that e has Gaussian entries in the rest of this paper. 
Johnstone [8] proved Part 2, giving simulation results suggesting that this ap- 
proximation is accurate for u v ,v as small as 5. Part 3 is implied by Theorem 
11.13 of Davidson and Szarek [3]. 

We combine the random matrix results of Proposition 4.1 with a perturbation 
argument based on that given by Sibson [13] and used by Critchley [2] in a 
related context. As in Sibson [13], the perturbed eigenvectors are given in terms 
of generalized inverse matrices M + . If Mx = Ax, the linear map M + is defined 
using the property that 

\ if A = 0. 

To motivate the proof of -y^n-consistency in Theorem 4.3, we change basis to 
the orthogonal set {fc>W} used in the proof of Theorem 2.4. Equations (17) and 
(19) below mean that in this new basis, X T X = (U + (tS)(U + ctS) t . Here UU T 
has a dx d block of the form /3(F T F)/3 T with the remaining entries being zero, 
and S is a p x r matrix whose entries are independent standard Gaussians. 

Hence if a = then X T X has d positive eigenvalues with eigenvectors lying- 
in the space spanned by the columns of T, and the remaining eigenvalues are 
zero. Using perturbation theory, we bound how large a would have to be before 
one of the zero eigenvalues could become one of the d largest ones. 

Definition 4.2. For a ^ 0, we take B = UU T and L = <r(SU T + US T ) + 
<r 2 SS T , so that X T X = B + L. We define the first level crossing event 

£ a = {Ai(L) - A P (L) > \ d ([3(F T F)f3 T )}. 

Since Ai(UU T ) = for i > d + 1, results of Weyl (see Lemma A.l below) 
imply that 

Ai(X T X) > A d (/3(F T F)/3 T ) + A P (L) for i < d, 
A; (X T X) < Ai (L) for i > d + 1. 

Hence, if L\ does not occur, the d largest eigenvalues of X T X correspond to the 
perturbed values of the original eigenvalues of /3(F T F)/3 T . Proposition 4.1 gives 
probabilistic bounds on Ai(L) — A P (L), allowing us to control P(£i). 

Theorem 4.3. In the case d <r, if the errors e are independent standard nor- 
mals and the limiting matrix 4> = lim„^ 00 (/3(F T F)/3 T )/?i has distinct eigenval- 
ues, then there exist confidence intervals 



p(e(r PFC ,r)>e») <a, 

where for any fixed a, the Q*(a) = 0(1/ \/n). 
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5. Theoretical performance of the PC algorithm 

Finally wc discuss the PC algorithm, arguing in Lemma 5.1 that, in the case of 
normal errors e, it will give inferior performance to the PFC algorithm. Propo- 
sition 5.4 gives bounds on the performance of the PC algorithm, explaining the 
simulation results of Cook [1]. 

Lemma 5.1. Under the model given by Equation (1), if the errors e are nor- 
mally distributed, then we can write X T X = X T X + N , where N is independent 
ofX and T. 

This result allows us to argue that PFC estimation should out-perform PC 
estimation, in several senses. Firstly, inference about random variable X through 
random variable Y is better (in the sense of Minimum Mean Squared Error) 
than inference about X through Y + N, if N is independent of X. This follows 
since the best estimates arc /(F) = E(X\Y) and g(Y + N) = E(X\Y + N) 
respectively. The fact that the MMSE is lower in the first case is equivalent to 
the fact that Eg(Y + N) 2 < Ef(Y) 2 , which follows by the conditional Jensen 
inequality. Similarly, the conditional entropy H(X\Y) < H(X\Y + N), showing 
that there is less uncertainty about X on learning Y than on learning Y + N. 

We use similar techniques to those in Section 4 to bound the PC angle 
9(rpc,r). We regard the term N as a perturbation of order a 2 of the fit- 
ted sample covariance matrix X T X, and thus regard Tpc as a perturbation of 

TpFC- 

Definition 5.2. Writing A^(X) for the ith ordered eigenvalue o/X, we define 
M = min ( Xi (X T X) - X l+1 (X T X)) (9) 

i<r V / 

for the minimum level spacing (this includes the spacing between zero and the 
non-zero eigenvalues, since A r +i(X T X) = 0). We define the second level crossing 
event by 

C 2 = |Ai((X-X) T (X-X)) > Afj . 

Lemma 5.3. If the errors e are independent standard normals, and if \[M lo > 
^fn + ^fp then the probability 

P(£ 2 ) < exp f-1 (Vm/ct -V^-Vp) 2 y 

If £2 does not occur, then there are exactly r eigenvalues of X T X larger than 
A r (X T X), so the r largest eigenvalues must correspond to the perturbed values 
of the original. 

We now prove bounds on the angle 9(rpc, Tpfc) between the PC and PFC 
directions, where C(rpc,rpFc) = cot 2 6(rpc, Tpfc) as before. Since M and 
||(X — X) T (X — X)|| arc Op(n), this angle is again of the order Op(l/y/n). 
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Proposition 5.4. In the case d = r, if the errors e are independent standard 
normals, the conditional expectation 



©(Tpc, Tppc) 



£2 ) < arctan 



a 2 y/np 



M — ||(X-X) T (X-X) 



(10) 



where, as in Equation (9), M is the minimum eigenvalue spacing o/X T X. 

Using Proposition 5.4, we continue to explain the simulation graphs given in 
Section 5 of Cook [1], consider the means of the angles 0(rpc,r). Recall that 
Example 2.5, based on Figure 1 of [1], considers normally distributed errors 
e with p = 10, d = r = 1, (3 = 1, and varies parameters n, a and ay. For 
simplicity, we replace some terms by their asymptotic limits to obtain more 
heuristic results. We recall that /3F T F/3 T ~ 0yXn-i) so that asymptotically we 
can take M = na Y - Similarly, we use the asymptotics given by Gcman [G] and 
replace || (X — X) T (X — X)|| by a 2 n. Combining Lemma 5.3 and Proposition 5.4, 
we know that writing C for arctan(oo) (so that C = ir/2 radians, or 90° for the 
graphs plotted) gives 



ie(r PC ,r PFC ) 

'e(f PC ,] +e( 

< Cexp 
+ arctan 




POCi) 



< C exp ( — - (Vn(<7Y I a — 1) — \fp)~ I + arctan 



We ignore the first term (since it decays exponentially fast in n if a Y > c 2 ). 
In the case r = d = 1, angles satisfy a triangle inequality: that is, if the angle 
between vectors Tpfc and r is 9\, and the angle between Tpc and Tppc is O2, 
then the angle between Tpc and r is < 6\ + 62. This means that, conditional 
on 0(rppc, r), the angle between the PC directions and the true T is bounded 
above by 



Ee(r PC , r) < e(r PF c, r) + arctan 



^n(a Y - a 2 



(11) 



where the distribution of C = cot 2 9(rppc, T) is given in Theorem 2.4. In each 
graph in Figure 2, we plot four curves, as follows: 

1. We simulate directly from the distributions given by Theorem 2.4 and plot 
the sample mean of the PFC angle 0(rppc, T) as o, as in Figure 1. 
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>=*=* 



50 100 150 200 250 2 4 6 8 10 

n sigma_Y 



(a) Angle vs n (b) Angle vs cry 




1 2 3 4 5 50 100 150 200 250 

sigma n 



(c) Angle vs <r (d) Angle vs n 

Fig 2. Simulation of angle between T and both Tppc and Tpc, in the setting of Example 2.5. 
We plot the sample mean of the PFC angle 8(rppc,r) as o, the true PC angle 0(rprj,r) 
as X, the bound on the expected angle from Equation (11) as + and the approximate expected 
angle from Equation (12) as A. (a) Angle vs n, with ay = o~ = 1; (b) Angle vs cry, with 
n = 40, o = 1; (c) Angle vs o, with n = 40, cry = 1 (d) Angle vs n, with ay = \/2, a = 1. 



2. Next, we plot the sample mean of the PC angle (based on direct simulation 
of 2500 samples from the underlying distribution) as x . 

3. We plot the bound given by the right hand side of (11) as +, noting that 
it is only useful when a\ > a 2 . 

4. Finally in the spirit of Sibson [13], we use A to plot the angle correspond- 
ing to the leading term in a in the power series expansion in Equation (36) 
of the proof of Proposition 5.4, that is 

e(f PPC ,r) + arctan . (12) 

\Vncrp J 

Providing a more complete description of the distribution of 8(rpc,r) than 
that given in Proposition 5.4, and extending such results to the case d < r, 
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would require more advanced results from random matrix theory. However, the 
results shown in Figure 2 give a good explanation of Figure 1 of [1], and indicate 
parameter regions where the PFC and PC algorithms give close or differing re- 
sults. In particular, the similarity of the plots in Figure 2 given by x and A shows 
that Equation (12) provides an accurate approximation to the PC performance. 
(In fact, (12) should approximately give an upper bound to the performance 
of this algorithm, because of the Jensen inequality used in Equation (37) of 
the proof of Proposition 5.4. However, it appears that the resulting distribution 
is sufficiently concentrated around its mean that the Jensen approximation is 
asymptotically accurate). 

Since the bound given by (11) is only valid where a\ > a 2 , we cannot plot it 
for all ranges of parameters considered by Cook - not at all in Figure 2(a), and 
only in a small region in Figure 2(c). We provide an extra series, Figure 2(d), 
where a\ = 2, to show the dependence of this bound on n. 

Appendix A: Proofs 

Proof of Lemma 2.2. As in [1], we write l ra for a n x 1 matrix of Is, so that if X 
is a n x p matrix with rows given by X y then il„l^X gives a matrix with rows 
all equal to the sample mean X. The assumption that J2 y = means that 
l£F = 0, which implies that P F 1„ = F(F T F) -1 F T 1„ = - that is, |l„l£ 
and Pf represent projections onto orthogonal subspaces. Hence the matrix X 
defined to have rows (X y — X) T can be expressed as (I„ — il n l^)X or, in terms 
of the quantities in Equation (1), as 

X = (i„ - (i»m t + F/3 T r T + e) = F/3 T r T + (i n - e. 

(13) 

This means that the fitted matrix of predictors 

X = Ff3 T T T + P F E = F (/3 T r T + (F T F)" 1 F T E^) . (14) 

Using Equation (14), we rewrite X = FN where N = {f3 T T T + (F T F)~ 1 F T E), 
so that we can define V T = (F T F)~ 1//2 F T X = (F T F) 1 / 2 N, and 

X T X = N T F T FN = ((F T F) 1/2 N) T (F T F) 1/2 N) = VV T . (15) 

Any vector w orthogonal to the span of the columns of V is a O-eigenvector of 
X T X, since for such a vector Equation (15) shows that 

X T Xw = VV T w = VO = 0, 

and similarly the span of the columns of V is preserved by X T X. Hence the r 
columns of V span the same space as the eigenvectors of X T X with the r largest 
eigenvalues. □ 
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Proof of Lemma 2.3. For any matrix A, the entries of AE have mean E(AE) = 
AE(E) = oAE(e) = 0, allowing us to deduce the mean of V. Similarly, we use 
the well-known fact that for any A and B: 

Cov ((AE)y, (BE)fcO = A ir B fc8 Cov (E ri ,E si ) = «7 2 (AB T ) tt %. (16) 

Taking A = B = (F T F) -1 / 2 F T means that AB T = I, so by Equation (16) the 
entries of V satisfy 

Cov (Vy, V W ) = Cov ((F T F)- 1 / 2 F T E) lJ , (F T F)^ 1 / 2 F T E) fei ) = a 2 5. lk 5 oU 

and the result follows. □ 

Proof of Theorem 2.4- In the case d = r, Lemma 2.2 tells us that we can choose 
to take a PFC estimate Tpfc given by an orthogonal transformation of the 
columns of V defined in Equation (3). Write b^, where i = 1, . . . , d, for the 
columns of T, which form an orthonormal set since, by assumption, r T T = 1^. 
We extend this to create an orthonormal basis {t/ 1 ', . . . , t>( p )} for the whole of 
M p , and write G for the p x p matrix made up of the complete set of columns 
bW, with G T G = I P . 

We express V in this new basis, for k = 1, . . . , r we expand the kth column 
of V as: 

p 

= J2 A *fcb W , where A lk = (b«) T V« . (17) 

i=l 

Equivalently, we write the p x r matrix A = G T V = + C. Here /j, = 
G T r/3(F T F) 1 / 2 , consists of a dx r block of the form /3(F T F) 1 / 2 and a (p-d) x r 
zero block, and C = G T E T F(F T F)~ 1 / 2 . Equation (17) and Lemma 2.3 show 
that A has mean /x, so that for i > d + 1 the A^ has mean 0, whereas for i < d 
the A^ has mean 

Mjfc = (f3(F T F)^) tk . (18) 

Using Lemma 2.3, we know that for any 1 < i, j < p and 1 < k, I < r: 

v v 

Cov(A ife ,A j7 ) = ^^(b«) u (bW) w Cov(V ufc ,V w ) 

U—l V — 1 

p p 

= ^DD(b (i) )«(b W )^ w fe=^fc/%. (19) 

u—l v—1 

The fact that T T r = Id implies that the projection Pr = rr T , so that the 
projections of the fcth column become 

d 

p r v< t: » = rr T v( fc ) = ^A lfc b«, 

»=i 

(i-Pr)vW = (i p -rr T )vW = J2 A * b(i) - 

i=d+l 
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Hence, the respective Frobcnius norms become 



l|PrV||| = J2 

k=l 

|(I P -Pr)V||| = J2 

k=l 



i=l 



ik ' 



(i) 



i=d+l 



££ A * 

k=l i=l 
r p 

= £ £ A ? 

fc=l i=d+l 



(20) 
(21) 



Note that the arguments so far do not require the assumption that the errors 
e are normal. Under this additional assumption, we deduce that the A^. are 
normal and independent, with common variance a 2 . In particular, under this 
assumption the expressions (20) and (21) are independent of each other. 

Further, if the errors are normal, Equation (21) means that || (I — Pr)V|||,/<7 2 
has a central \ 2 distribution with r(p—d) degrees of freedom. Similarly, Equation 
(20) means that ||PrV|||,/CT 2 has a non-central x 2 distribution with rd degrees 
of freedom, and non-centrality parameter 



A 



1 



r d 

EEm 

k=l i=l 



1 



ik 



tr (/3(F T F)/3 T ), 



(22) 



□ 



and the proof is complete. 

Proof of Lemma 3.1. We write = A.^ — fi ik , and use Equations (18)-(21). 
In particular, Cik have mean zero and variance <r 2 , and Equation (18) implies 
that J2k,, Z-4 = tr (/3( fTf )/3 T )- He nce we know that N = Ylk=\ Ei=i A ik has 
mean tr (/3(F T F)/3 T ) + rda 2 and D = Y,k=i Ei=d+i A ik has mean r (P ~ d )° 2 ■ 
For i < d, we write C 4fe = ^ r s r n E sr W s |. where W = F(F T F)^ 1 / 2 , and 
M = rr T . We deduce that Var (JV) equals 




4 £ fx ik EC ik C 



d 2 a A +4tr(/3(F J F)/3 J ) ( x 2 . (23) 



In this expansion, we know that ECifcC 2 fc = 0, since it can be written as a 
linear combination of expectations of terms in Ejj of degree 3. Each such term 
contains a term of odd degree, so independence and symmetry means the re- 
sulting expectation is zero. Similarly, using independence and symmetry, since 
only terms of the form EE|- and EEyEjy make a non-zero contribution, for any 
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k we can expand 

= E ( ]T W sfe W, fc E sr E t11 M ru j 

= a 4 W^M^(m 4 - 3) + a 4 ]T W^W^M^M™ + 2M ru M ur ) 

r,s r,s,t,u 

< W ]T W^W^M rr M„„ + (?7i 4 - l)M ru M„ r j 

\r,s,t,u / 

= a 4 (d 2 + (m 4 - l)d) 

since £ s W2 fc = £ s W£ s W sfe = 1, the £ r M rr = tr (IT T ) = tr (r T r) = d, 
and E u M ru M ur = M rr , so the result follows. Similarly we deduce Var (D) < 
(p - d)T. □ 

If e arc independent with mean zero and finite 4th moment, but no longer 
symmetric, we can prove similar results. In this case, the Ei j=i t^ik^^ikC^ 
term becomes Or{\fn), which allows this proof to be adapted. 

Fulton [5] reviews many facts concerning the eigenvalues of sums of matrices, 
including the following result: 

Lemma A.l (Wcyl [14]). For any real symmetric B and L: 

A,(B) + A P (L) < A 4 (B + L) < A i (B)+A 1 (L) for i = 1,2, ... ,p. 

Proof of Lemma 3.2. Since F has rows which correspond to independent sam- 
ples of Y, the Law of Large Numbers means that (F t F)/tt, converges element- 
wise almost surely to a real symmetric matrix Hence, using the union bound, 



SUPj 



(f3(F T F)r3 1 ). ll /n-^ l 



can be made arbitrarily small for all n suffi- 
ciently large with probability 1 — e. This implies that the spectral norm of the 
difference ||(/3(F t F)/3 t /ti-*|| can be made smaller than 8 with the same prob- 
ability. By Lemma A.l, this implies that the ith eigenvalue Ai(/3(F T F)/3 T /n) is 
arbitrarily close to 4>i = Ai(*&). □ 

Proof of Theorem 3.3. We write C(rppc, T) = N/D, in the notation of Lemma 
3.1 and define the event 

= Ik, < tr (^ FTF ^ T ) < K 2 for all n > n*\ . 



By Lemma 3.2, given any a there exist Ki,K2,n* such that P(i?Ki,_fs" 2 ,n*) ^ 
1 — a/3. Note that on Bk^,k^,ti* , Lemma 3.1 implies that the mean E7V > 
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Kin and Var N < Td + AK 2 n. We use Chcbyshev and choose 7SPJL = Kin 
y/3(Td + 4K 2 n)/a to write that 



We bound an upper confidence interval by taking X + = (3r(p — d)a 2 /a) and 
combining the fact that D is independent of the event Bx ll K 2 ,n* with Markov's 
inequality to obtain 

p(e(f PFC) r) > e;) 

= P(e(f P FC,r) > &* + \B Kl ,K 2 ,n'MB Kl ,K 2 ,n*) 

+p(e(f PFC ,r) > ®* + \B c KuK ^ n ,)F{B c KuK ^) 

< F(D/N > tan 2 (e* + )\B KuK2tn .)+F(B c KuK2tn .) 

< P (D/N > tzn 2 (e* + ),D< X+\B KltKatn .) 

+P (D/N > tan 2 (9;),D > X + \B KliK2>n .) + a/3 

< F(N < X + /tan 2 (9* + )\B KuK2tn *)+F(D > X+\B KliKain .) + a/3 

< P(N < X + /tan 2 (e;)) + 2a/3. (25) 

We obtain the required bounds by equating N+ and X + / tan 2 (9?j_), and using 
Equation (24) and the fact that arctan (t) < t. That is, we can take 



01(a) = arctan ( , / ; X+ = ) , (26) 

+ VV Km - ^/3(T + 4K 2 n)/a) ' 

and substitute the value X + = (3r(p — d)a 2 /a) given above. We can find a lower 
confidence interval, since for any X_, a similar argument to Equation (25) gives 

P(0(f PFC ,r) < 91) < P(D <X-)+F(N> X_/tan 2 (01)) +P(fl^ 1 , Jfa ,„.)- 



By Chebyshev and Lemma 3.1, we can take X_ = a 2 r(p — d) — y/ 3T(p — d) /a 
to ensure that F(D < X-) < a/3, and N* = K 2 n + rda 2 + ^3(Td + 4K 2 n)/a 
to ensure that F(N > Nt) < a/3. Again, equating iV* and X-j tan 2 (91), we 
deduce that 



91 (a) = arctan ( , / — = | , (27) 

\Y K 2 n + rda 2 + V /3(T + 4K 2 n)/a J 

and the result follows in the same way. □ 

Proof of Theorem 4-3- During this proof, for simplicity, we write Xi for Ai(UU T ), 
which equals A 4 (/3(F T F)/3 T ) if % < d and zero otherwise. First, we introduce the 
event 

C#,„* = I A,(#) — 5 < — < A,(#) + S for alH = 1, . . . , d, and all n > n* \ , 
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where S is defined in terms of the intcrlevcl spacing as 6 = minj<d(Aj(«fr) — 
Aj + i(*))/10, writing Ad+i(<&) = 0. Note that S > by assumption. 

We mirror the proof of Theorem 3.3 by conditioning on whether C* in » and 
C\ occur, to obtain 

p(e(f PFC ,r) > e*) 
= p(e(fpFo,r)>e*|c^ )B .)P(c^, n O 

+p(e(f PFC ,r) > e*|A,c*, n .)P(A|c»,n') ]p (c*,n«) 
+p(e(f PFC ,r) > e*|£;,c $ ,„.)p(£;,c f ,„.) 

< P(Q i „,)+P(A|C*,nO+P(sme(fp FC) r) > sin e*| a, <?*,„,)• (28) 

Lemma 3.2 means that the first term in Equation (28) is less than a/3 for n* 
sufficiently large. Next we bound the second term in Equation (28), using the 
fact that Ai(L) - A P (L) < 4<T A /||SS T ||||UU r || + o- 2 j|SS T ||, where we write || • || 
for the spectral norm. Completing the square and using Equation (8), we deduce 
that 



P(A|C*,nO = P(Ai(L)-A„( L ) > MC*, n >) 
< 



4 CT ,/||SS^| 



|UU T I 



T 2 ||SS T || >A rf 



cu 



\SS T \\ > ^4X 1 + X d - 



\ss T \\ >V^- + Vp- 

< exp(-u 2 /2) 



Ca 



where u = (^/4Xl + Xd — 2^/Xi)/a — ■sjr — ■ s fp. The last inequality uses the fact 
that S is a p x r matrix of independent standard Gaussians, so we can apply 
Equation (8). Conditioning on the event C* jTl * ensures that 



u > V^4Ai(*) + XdW) - 5(S-2VAi(*) + S)/v-y/r-jp = ay/n-y/r-y/p, 

where the choice of 5 ensures that a > 0, so the second term in Equation (28) 
is less than a/3 for n sufficiently large. 

Finally we condition on the event C\ n C$.„» The key result is (as in [13]) 
that if ei is a Xi -eigenvector of UU T then if satisfies 



(UU T - A,I P ) + (L - Mp)(e* + fi), 



(29) 



then ei + fj is a (A, + /^-eigenvector of X T X = UU T + L, with _L e,. Hence, 
Equation (29) tells us that, conditioned on these perturbed eigenvectors 
form the PFC directions. Since we condition on the event C* jn * , the norm 



(UU T - A,-I p )+|| < 



— 77 7-T < t-Hl 

min^j (Aj — Xi) no 



(30) 
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We need to probabilistically bound the norm of (L — /i;I p ). Lemma A.l implies 
that A P (L) < /Xj < Ai(L), so that for all i the norm 



|(L - < Ai(L) - A P (L) < 4a v /||SS r ||||UU r || + a 2 ||SS T 



(31) 



Using Proposition 4.1, we know that since S is a p x r matrix of standard 
Gaussians, there exists K\ such that P(||SS T || > K±) < a/3. Since we condition 
on the event C# ira *, the ||UU T || < (Ai(3>) + 5)n = K^n for n sufficiently large. 
Overall, we deduce that 



|L- M ilp)|| >ia^KJ<: 



2n 



<o/3. 



(32) 



Taking K = 10(4a^K 1 K 2 + a 2 Ki)/5 and using Equations (29) and (30), we 
deduce that , since sin 6 (Tpfc , T ) represents a weighted average of sines of angles 
between PFC and unperturbed eigenvectors, it is dominated by the maximum 
sine of such an angle. In other words 



;sme(f PFC ,r) > k/v^ £S,c»,».) 



< 



< 



(II (uu ] 



max || L — jUjlpH > 



A.;I P ) + (L - Hilp) 
y/riSK 



> K 



£\1 C#,Tl* ^ 



10 



|L- Mi Ip)ll > (W^i^a + ^Kx)^ L\, C #>n .) 
< p(||L- Hil p )\\ >4a^/Kj{^ + a 2 K 1 \c c 1 ,C^, n ,) < a/3, (33) 



by Equation (32). Taking sin 6* = K/^Jn, Equation (33) tells us that the third 
term of Equation (28) is less than a/3. The result follows using the fact that 
9* = arcsin(AVv^) = 0{\/y/n). □ 

Proof of Lemma 5.1. The key is to notice that using Equations (13) and (14), 
and writing Pg = (l n — il„l^ — Pf) for the projection onto the space or- 
thogonal to the columns of F and to the column made up of Is, we can express 



X-X = 



1 T 
T 1 1 1 

n 



E, 



(34) 



which is not a function of T. As before, since Pf and Pg represent projection 
onto orthogonal spaces, the PfPg = 0- This means that 



X T (X -3 

so, since the cross-term vanishes, 



[ T P F P T 



G-' 



= 0, 



X T X 



(X + (X - X)) T (X + (X- X)) = X T X + (X - X) T (X - X). (35) 



Further, by Equation (16), the relation PpP G = means that the terms X 
and X — X are uncorrelated. Hence, if the errors e are normally distributed, the 
terms X and X — X are independent and the lemma follows. □ 
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Proof of Lemma 5. 3. Note that Equation (34) means that (X-X) T (X-X) = 
(t 2 X t PgX, where X is an nX p Wishart matrix with standard Gaussian entries 
independent of X T X. We take t = VM/a — (^/n + y/p) and apply the result of 
Davidson and Szarek, Equation (8). 

Since (X - X) T (X - X) is positive-definite, A P ((X - X) T (X - X)) > 0, and 
since A^(X T X) = for i > r + 1, Lemma A.l and Equation (35) imply that 

A. t (X T X) < A 4 (X T X) < Ai(X T X) + Ai((X - X) T (X - X)) 

Hence, if £2 does not occur, that is if Ai((X— X) T (X— X)) < M, there are exactly 
r eigenvalues of X T X larger than A r (X T X), and the r largest eigenvalues must 
correspond to the perturbed values of the original. □ 

Next we prove another technical lemma: 

Lemma A. 2. Given n x n projection matrix Pg and n x p matrix X with 
independent standard Gaussian entries, consider matrices R = XW and S = 
XV, where W T V = 0. Then 

E (tr (RR T P G SS T P G )) = tr (W T W)tr (V T V)rank (G). 

Proof. Using the representation R = XW and Equation (16) we deduce that 
E(RR T )y = Sijtr (W T W). Similarly E(SS T ) i3 = ^tr (V T V). The fact that 
W T V = makes R and S independent by Equation(16), so that: 

n n n n 

E(tr(RR T P G SS T P G )) = ^^^^E(RR T > y (P G ), fc E(SS) i ,,(P G ) H 

t=l j=l k=l 1=1 

n n 

= tr (W T W)tr (V T V)^^(P G ) tfe (P G ) fet 

t=i fe=i 

= tr (W T W)tr (V T V)tr (P G P G ) 

as required. □ 

Proof of Proposition 5.4- During this proof, for simplicity, we write \i = A,(X T X) 
and L = (X — X) T (X — X), where Equation (34) means that as before L = 
cr 2 X T P G X, with P G = (I - - Pf)- We condition on the r PFC direc- 

tions being u^, where is a A 2 : -eigenvector of X T X. As in Equation (29) and 
[13], if the vectors v< satisfy 

(l p + (X T X - XiI p )+(L - fi, t I P )) v, = -(X T X - A i I p )+Lu i . (36) 

then u.i+Vi are (Ai+/^i)-eigenvectors of X T X, and give the r PC directions. This 
definition of Vj makes Vj _L u^, so that P Ui rpc,i = and (l p — P Ui )rpc,i = v i- 
As before, we know that for all i: 

||I P + (X T X-AJ P )+(L- Mi I p )|| > 1- ||(X T X- AT P )+|| x ||(L- Mi I P )|| 

IILII 

M 
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This means that, conditioned on £2, 



< 



< 



|(X T X-A I I P ) + Lu I p 



|I p + (X r X-AT p )+(L- M J p )|| 2 
(X^-A^+Luil 2 



(1-|NW 

Writing U for the matrix made up of columns Uj, using Jensen's inequality 



tan 2 (E(e(fpc,U)|£ 2 )) < E (tan 2 6(f PC , U)| £5 



(37) 



E 



Eil(i P -Pu)r PC>i | 



< E 



EJPur PC , 4 | 2 
EiKip-PuJfpc,, 



Co 



< 



EJPu.rpc.d 2 

E.E(((X T X-AJ p )+Lu a ) 2 |£ 2 ) 
(l-HLH/MfEJu.^ 



(38) 



Using Lemma A. 2 and recalling the fact that L = <j 2 X t PgX, we know that 
the numerator of Equation (38) satisfies 

cr 4 Euf X T P G X(X T X - A J p )+(i T i - A i I p )+X T P G Xu 4 

i 

= a 4 Y, E (tr (Sf PcRiRf P G S0) 

i 

= a 4 ^E(tr(R,RfP G S l SfP G )) 

i 

= a 4 rank (G) ^ tr ((^ T X - AJ P )+ 2 ) | Ui | 2 

i 

<r 4 np ^ 



< 



h\ 2 , 



where Ri = X(X T X — \H p ) + : Sj = Xu^, so the conditions of Lemma A. 2 apply. 
Substituting into Equation (38), we deduce the result. □ 
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